library("data.table")
library("ggplot2")
library("tidyverse")
library("MOFA2")Applying MOFA to the Chronic Lymphocytic Leukemia cohort
Introduction
The data consist of \(N=200\) blood samples from a cohort of Chronic Lymphocytic Leukemia (CLL) patients, where four omics data types were profiled: DNA methylation (450K Illumina microarray), bulk RNA-seq, somatic mutations and ex-vivo drug response assay. The dataset was introduced in detail by Dietrich et al. (2018) and can be downloaded here. Its MOFA analysis was originally published by Argelaguet et al. (2018), and refined by Lu et al. (2021).
Load packages
Make sure that you have installed the MOFA2 and the MOFAdata package.
Load data
The data are stored as a list of matrices, with features in the rows and samples in the columns:
data("CLL_data", package = "MOFAdata")
sapply(CLL_data, dim) Drugs Methylation mRNA Mutations
[1,] 310 4248 5000 69
[2,] 200 200 200 200
mRNA expression
The mRNA expression were normalised by library size, followed by a variance stabilizing transformation using DESeq2:
hist(CLL_data$mRNA, breaks = 50)DNA methylation
DNA methylation was calculated for every CpG site using the M-value, which provides a better summary statistic for downstream analysis. For the MOFA analysis we selected the top 1% (\(N=4248\)) most variable sites.
hist(CLL_data$Methylation, breaks = 50)Drug response
The authors measured the effect of multiple drugs ex vivo using a high-throughput platform. For each drug they used 5 concentrations. The value reported is the viability score (0=all cells died, 1=no cells died).
hist(CLL_data$Drugs, breaks = 50)Somatic mutations
Mutations were assessed using a panel of common cancer mutations and are summarised in a binary format (0=no mutation, 1=mutaton):
table(CLL_data$Mutations)
0 1
8474 667
Sample metadata
Load sample metadata as a data.frame. Important columns are:
- Gender: m (male), f (female)
- Age: age in years
- TTT: time (in years) which passed from taking the sample to the next treatment
- TTD: time (in years) which passed from taking the sample to patients’ death
- treatedAfter: (TRUE/FALSE)
- died: whether the patient died (TRUE/FALSE)
CLL_metadata = fread("http://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/sample_metadata.txt")
head(CLL_metadata) |> knitr::kable()| sample | Gender | age | TTT | TTD | treatedAfter | died | IGHV | trisomy12 |
|---|---|---|---|---|---|---|---|---|
| H005 | m | 75.26575 | 0.5749487 | 2.625599 | TRUE | FALSE | 1 | 0 |
| H006 | m | NA | NA | NA | NA | NA | NA | NA |
| H007 | f | NA | NA | NA | NA | NA | NA | NA |
| H008 | m | NA | NA | NA | NA | NA | NA | NA |
| H010 | f | 72.78082 | 2.9322382 | 2.932238 | FALSE | FALSE | 0 | 0 |
| H011 | f | 72.99452 | 0.0191650 | 2.951403 | TRUE | FALSE | 1 | 0 |
For some of the correlation analysis in the following, we want to treat sex and died as numeric variables, and hence we create these here.
CLL_metadata = mutate(CLL_metadata,
sex = as.numeric(as.factor(Gender)),
diedn = as.numeric(died))Create the MOFA object and train the model
Create the MOFA object
MOFAobject = create_mofa(CLL_data)
MOFAobjectUntrained MOFA model with the following characteristics:
Number of views: 4
Views names: Drugs Methylation mRNA Mutations
Number of features (per view): 310 4248 5000 69
Number of groups: 1
Groups names: group1
Number of samples (per group): 200
Plot data overview
Visualise the number of views (rows) and the number of groups (columns) exist, what are their corresponding dimensionalities and how many missing information they have (grey bars).
plot_data_overview(MOFAobject)Define MOFA options
Model options
Two important options:
- num_factors: number of factors
- likelihoods: likelihood per view (options are “gaussian”, “poisson”, “bernoulli”). By default the “gaussian” distribution is used. When having binary data, as is the case for Somatic mutations, one should change the likelihood to “bernoulli”:
model_opts = get_default_model_options(MOFAobject)
model_opts$likelihoods["Mutations"] = "bernoulli"
model_opts$num_factors = 15
model_opts$likelihoods
Drugs Methylation mRNA Mutations
"gaussian" "gaussian" "gaussian" "bernoulli"
$num_factors
[1] 15
$spikeslab_factors
[1] FALSE
$spikeslab_weights
[1] TRUE
$ard_factors
[1] FALSE
$ard_weights
[1] TRUE
Train the MOFA model
Prepare the MOFA object
MOFAobject = prepare_mofa(MOFAobject, model_options = model_opts)Train the model: this should take ~5min.
MOFAobject = run_mofa(MOFAobject)Add sample metadata to the model
The sample metadata must be provided as a data.frame and it must contain a column sample with the sample IDs. Make sure that the samples in the metadata match the samples in the model
stopifnot(CLL_metadata$sample %in% samples_metadata(MOFAobject)$sample)
samples_metadata(MOFAobject) = CLL_metadataRename features
Just run the below code and do not worry about the details, it is just tedious identifier reshuffling. We keep the model with the original variable names for the gene set enrichment analysis section
MOFAobject.ensembl = MOFAobjectupdated_features_names = features_names(MOFAobject)
# Rename drug IDs (i.e. D_001) to drug names (i.e. navitoclax)
drug_metadata = fread("http://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/drugs.txt.gz")
tmp = drug_metadata$name; names(tmp) = drug_metadata$drug_id
updated_features_names[["Drugs"]] = stringr::str_replace_all(features_names(MOFAobject)[["Drugs"]], tmp)
# Rename mRNA from ENSEMBL IDs (i.e. ENSG00000223972) to gene names (i.e. DDX11L1)
gene_metadata = fread("http://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/Hsapiens_genes_BioMart.87.txt.gz")
gene_metadata[,symbol:=ifelse(symbol=="",ens_id,symbol)]
tmp = gene_metadata$symbol; names(tmp) = gene_metadata$ens_id
# avoid duplicated names with the Mutations view
tmp[tmp%in%features_names(MOFAobject)[["Mutations"]]] = paste0(tmp[tmp%in%features_names(MOFAobject)[["Mutations"]]],"_mRNA")
updated_features_names[["mRNA"]] = stringr::str_replace_all(features_names(MOFAobject)[["mRNA"]], tmp)
# Update features names in model
features_names(MOFAobject) = updated_features_namesVariance decomposition analysis
Variance decomposition by Factor
The most important insight that MOFA generates is the variance decomposition analysis. This plot shows the percentage of variance explained by each factor across each data modality.
plot_variance_explained(MOFAobject, max_r2=10)What insights from the data can we learn just from inspecting this plot?
- Factor 1, Factor 2 and Factor 3 each captures a source of variability that is present in more than 3 data modalities. Thus, its etiology is likely to be something very important for the disease.
- Factor 4 captures some co-variation between the mRNA and the drug response assay.
- Factor 5 captures a strong source of variation that is exclusive to the mRNA data.
(Q) Based on the MOFA output, if you were to profile just one molecular layer, which one would you choose to maximise the amount of sources of variation captured?
Characterisation of Factor 1
There are a few systematic strategies to characterise the molecular signal that underlies each MOFA Factor and to relate them to existent sample covariates:
- Association analysis between the sample metadata and the Factor values.
- Inspection of factor values.
- Inspection of the feature weights.
- Gene set enrichment analysis on the mRNA weights.
Association analysis
Let’s test for associations between the MOFA factors and some of the covariates:
correlate_factors_with_covariates(MOFAobject,
covariates = c("sex", "age", "diedn"),
plot = "log_pval"
)Several factors showe associations with survival outcome (whether the patients were deceased). We will explore association with clinical measurements later in the tutorial.
Inspection of factor values
How do we interpret the factor values?
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.
plot_factors(MOFAobject,
factors = c(1,2),
dot_size = 2.5
)Inspection of feature weights
How do we interpret the feature weights?
The weights provide a score for each feature on each factor. Features with no association with the corresponding factor are expected to have values close to zero, whereas features with strong association with the factor are expected to have large absolute values. The sign of the weights indicates the direction of the effect: a positive weights indicates that the feature has higher levels in the cells with positive factor values, and vice-versa.
Plot feature weights for somatic mutations
By looking at the variance explained plot, we saw that Factor 1 captures variation in all data modalities. Out of all omics, the somatic mutation data is a good place to start, as somatic mutations are very sparse, easy to interpret and any change in the DNA is likely to have downstream consequences to all other molecular layers. Let’s plot the weights:
plot_weights(MOFAobject,
view = "Mutations",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)Notice that most features lie at zero, indicating that most features have no association with Factor 1. There is however one gene that clearly stands out: IGHV (immunoglobulin heavy chain variable region). This is the main clinical marker for CLL.
An alternative visualisation to the full distribution of weights is to do a line plot that displays only the top features with the corresponding weight sign on the right:
plot_top_weights(MOFAobject,
view = "Mutations",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)IGHV has a positve weight. This means that samples with positive Factor 1 values have the IGHV mutation whereas samples with negative Factor 1 values do not have the IGHV mutation. To confirm this, let’s plot the Factor values and colour the IGHV mutation status.
plot_factor(MOFAobject,
factors = 1,
color_by = "IGHV",
add_violin = TRUE,
dodge = TRUE,
show_missing = FALSE
)We can also plot Factor values coloured by other covariates, for example Gender. As concluded from the association analysis above, this variable has no association with Factor 1:
plot_factor(MOFAobject,
factors = 1,
color_by = "Gender",
dodge = TRUE,
add_violin = TRUE
)Plot gene weights for mRNA expression
From the variance explained plot we know that Factor 1 drives variation across all data modalities. Let’s visualise the mRNA expression changes that are associated with Factor 1:
plot_weights(MOFAobject,
view = "mRNA",
factor = 1,
nfeatures = 10
)Plot molecular signatures in the input data
In this case we have a large amount of genes that have large positive and negative weights. Genes with large positive values will be more expressed in the samples with IGHV mutation, whereas genes with large negative values will be more expressed in the samples without the IGHV mutation. Let’s verify this. The function plot_data_scatter generates a scatterplot of Factor 1 values (x-axis) versus expression values (y-axis) for the top 4 genes with largest positive weight. Samples are coloured by IGHV status:
plot_data_scatter(MOFAobject,
view = "mRNA",
factor = 1,
features = 4,
sign = "negative",
color_by = "IGHV"
) + labs(y="RNA expression")An alternative visualisation is to use a heatmap
plot_data_heatmap(MOFAobject,
view = "mRNA",
factor = 1,
features = 25,
cluster_rows = FALSE, cluster_cols = FALSE,
show_rownames = TRUE, show_colnames = FALSE,
scale = "row"
)Prediction of individual markers for personalised treatment based on the patient’s IGHV status
(Q) Can you suggest new RNA expression and DNA methylation markers for personalised treatment recommendations according to Factor 1 (the IGHV status)?
First explore the MOFA weights, then go back to the input data and do boxplots for the chosen markers (x-axis being the IGHV status and y-axis being the marker’s expression or DNA methylation values). Hints: - The section Customized analysis may be helpful to extract the weights and the data in a long data.frame format - the IGHV status for each sample can be fetched from the CLL_metadata object - the ggpubr package provides very useful ggplot-based visualisations, including boxplots with p-values. I highly recommend it!
Characterisation of Factor 2
(Q) Your task is to provide a characterisation for Factor 2.
Try a similar pipeline as for Factor 1 and answer the following questions:
- Which mutation underlies Factor 2?
- Can you identify mRNA markers?
- Do a (small) bibliographical search to check if your predictions make sense
Inspection of combinations of Factors
Now that we have characterised the etiology of the two main Factors, let’s do a scatterplot colouring each patient by their somatic mutation profile:
p = plot_factors(MOFAobject,
factors = c(1,2),
color_by = "IGHV",
shape_by = "trisomy12",
dot_size = 2.5,
show_missing = T
)
p = p +
geom_hline(yintercept=-1, linetype="dashed") +
geom_vline(xintercept=(-0.5), linetype="dashed")
print(p)Warning: Removed 27 rows containing missing values (`geom_point()`).
Prediction of clinical subgroups
The scatterplot of Factor 1 vs Factor 2 reveals that a few samples are missing the somatic mutation status. In this case, the doctors were not able to classify patients into their clinical subgroups. But we can now use MOFA to exploit the molecular profiles and attempt to impute the IGHV and trisomy12 status.
library("randomForest")# Prepare data
df = as.data.frame(get_factors(MOFAobject, factors=c(1,2))[[1]])
# Train the model for IGHV after removing missing observations
df$IGHV = as.factor(samples_metadata(MOFAobject)$IGHV)
model.ighv = randomForest(IGHV ~ ., data=df[!is.na(df$IGHV),])
# Do predictions
samples_metadata(MOFAobject)$IGHV.pred = stats::predict(model.ighv, df)# Prepare data
df = as.data.frame(get_factors(MOFAobject, factors=c(1,2))[[1]])
# Train the model for Trisomy12 after removing missing observations
df$trisomy12 = as.factor(samples_metadata(MOFAobject)$trisomy12)
model.trisomy12 = randomForest(trisomy12 ~ ., data=df[!is.na(df$trisomy12),])
samples_metadata(MOFAobject)$trisomy12.pred = stats::predict(model.trisomy12, df)Plot predictions for IGHV
samples_metadata(MOFAobject)$IGHV.pred_logical = c("True","Predicted")[as.numeric(is.na(samples_metadata(MOFAobject)$IGHV))+1]
p = plot_factors(MOFAobject,
factors = c(1,2),
color_by = "IGHV.pred",
shape_by = "IGHV.pred_logical",
dot_size = 2.5,
show_missing = T
)
p = p +
geom_hline(yintercept=-1, linetype="dashed") +
geom_vline(xintercept=(-0.5), linetype="dashed")
print(p)Gene set enrichment analysis
In addition to exploring the individual weights for each factor, we can use enrichment analysis to look for signiificant associations of factors to genesets. Here, we use the Reactome genesets for illustrations, which is contained in the MOFAdata package. For more details on how the GSEA works we encourage the users to read the GSEA vignette
Load Reactome gene set annotations.
Gene set annotations are provided as a binary membership matrix, with genes in the rows, pathways in the columns. A value of 1 indicates that the corresponding gene belongs to the corresponding pathway.
data("reactomeGS", package = "MOFAdata")
head(colnames(reactomeGS))[1] "ENSG00000187634" "ENSG00000188976" "ENSG00000187961" "ENSG00000187583"
[5] "ENSG00000187642" "ENSG00000188290"
head(rownames(reactomeGS))[1] "Interleukin-6 signaling"
[2] "Apoptosis"
[3] "Hemostasis"
[4] "Intrinsic Pathway for Apoptosis"
[5] "Cleavage of Growing Transcript in the Termination Region "
[6] "PKB-mediated events"
Run enrichment analysis
These are the steps for doing Gene Set Enrichment Analysis (GSEA) with MOFA:
- (1) Define your gene set matrix: this can be specified as a binary matrix where rows are gene sets and columns are genes. A value of 1 indicates that gene
jbelongs to pathwayi. A value of 0 indicates elsewise.
- (2) Select a gene set statistic: the statistic used to quantify the scores at the pathway level. Must be one of the following:
mean.diff(difference in the average weight between foreground and background genes) orrank.sum(difference in the sum of ranks between foreground and background genes).
- (3) Select a statistical test: the statistical test used to compute the significance of the gene set statistics under a competitive null hypothesis. Must be one of the following:
parametric(a simple and very liberal parametric t-test),cor.adj.parametric(parametric t-test adjusted by the correlation between features),permutation(unparametric, the null distribution is created by permuting the weights. This option is computationally expensive, but it preserves the correlation structure between features in the data.).
enrichment.results = run_enrichment(
object = MOFAobject.ensembl,
view = "mRNA",
feature.sets = reactomeGS,
set.statistic = "mean.diff",
statistical.test = "parametric"
)The enrichment analysis returns a list of 5 elements:
- feature.sets: the feature set matrix filtered by the genes that overlap with the MOFA model.
- pval: the nominal p-values.
- pval.adj: the FDR-adjusted p-values.
- feature.statistics: the feature statistics (i.e. the weights).
- set.statistics: matrices with the gene set statistics.
- sigPathways: list with significant pathways per factor at a specified FDR threshold
names(enrichment.results)[1] "feature.sets" "pval" "pval.adj"
[4] "feature.statistics" "set.statistics" "sigPathways"
Plot enrichment analysis results
Plot an overview of the number of significant pathways per factor.
It seems that most of the Factors do not have clear gene set signatures. A clear exception is Factor 5, which has a very strong enrichment for genes with positive weights.
plot_enrichment_heatmap(enrichment.results)(Q) Can you characterise Factor 4 based on the GSEA results? Which genes are driving the top enriched pathways?
Hint: use the functions plot_enrichment
(Q) Which drugs are associated with Factor 4? What is their target pathway? Do they make biological sense?
Hint: use the drug_metadata object
Customized analysis
For customized exploration of weights and factors, you can directly fetch the variables from the model using get_weights and get_factors:
weights = get_weights(MOFAobject,
views = "all",
factors = "all",
as.data.frame = TRUE
)
head(weights) feature factor value view
1 navitoclax_1 Factor1 -8.043809e-05 Drugs
2 navitoclax_2 Factor1 -1.163131e-03 Drugs
3 navitoclax_3 Factor1 -2.162450e-02 Drugs
4 navitoclax_4 Factor1 -1.950353e-02 Drugs
5 navitoclax_5 Factor1 -1.093797e-02 Drugs
6 ibrutinib_1 Factor1 3.277712e-03 Drugs
factors = get_factors(MOFAobject,
factors = "all",
as.data.frame = TRUE
)
head(factors) sample factor value group
1 H045 Factor1 -2.210321 group1
2 H109 Factor1 -2.234613 group1
3 H024 Factor1 2.515173 group1
4 H056 Factor1 1.840702 group1
5 H079 Factor1 -1.917438 group1
6 H164 Factor1 -2.804674 group1
Building predictive models of clinical outcome
The factors inferred by MOFA can be related to clinical outcomes such as time to treatment or survival times. As this type of data is censored (not for all samples we have already observed the event) we will use Cox models for this purpose. In a Cox proportional hazards model we model the hazard of an event ocurring (e.g. death or treatment) as a function of some covariates (here the factors). If a factor has a influence on the surivival time or time to treatment it will receive a high absoulte coefficient in the Cox model. In particular:
- If the coefficient is positive, samples with large factor values have an increased hazard (of death or treatment) compared to samples with small factor values.
- If the coefficient is negative, samples with small factor values have an increased hazard compared to samples with a large factor values.
To fit these models we will use the coxph function in the survival package. The survival data is stored in a survival object that contains both the time a sample has been followed up and whether the event has occured (as 0,1).
Let’s take time to treatment as an example here. The sample metadata contains the follow-up times per sample in years in the column TTT, and the column treatedAfter indicated whether a treatment occured.
Fit Cox models
library("survival")
library("survminer")SurvObject = Surv(samples_metadata(MOFAobject)$TTT, samples_metadata(MOFAobject)$treatedAfter)
facs = get_factors(MOFAobject)[[1]]
fit = coxph(SurvObject ~ facs)
fitCall:
coxph(formula = SurvObject ~ facs)
coef exp(coef) se(coef) z p
facsFactor1 -0.28402 0.75275 0.06684 -4.249 2.14e-05
facsFactor2 -0.23325 0.79196 0.12775 -1.826 0.067873
facsFactor3 -0.22684 0.79705 0.09745 -2.328 0.019926
facsFactor4 0.41687 1.51720 0.10629 3.922 8.78e-05
facsFactor5 0.06287 1.06488 0.06619 0.950 0.342230
facsFactor6 0.09984 1.10499 0.10681 0.935 0.349923
facsFactor7 0.14934 1.16107 0.08638 1.729 0.083832
facsFactor8 0.37342 1.45269 0.08968 4.164 3.13e-05
facsFactor9 -0.53862 0.58355 0.10386 -5.186 2.15e-07
facsFactor10 0.00265 1.00265 0.09847 0.027 0.978527
facsFactor11 0.47884 1.61420 0.13197 3.628 0.000285
facsFactor12 -0.12444 0.88299 0.13887 -0.896 0.370195
facsFactor13 -0.11774 0.88893 0.10655 -1.105 0.269187
facsFactor14 -0.08792 0.91583 0.14405 -0.610 0.541629
facsFactor15 0.09182 1.09617 0.05942 1.545 0.122265
Likelihood ratio test=107.9 on 15 df, p=4.021e-16
n= 174, number of events= 96
(26 observations deleted due to missingness)
We can see that several factors have a significant association to time to treatment. For example, Factor 1 has a negative coefficient. Samples with low factor values have an increased hazard compared to samples with a large factor values.
Plot Hazard ratios
(Q) Which Factors are associated with the clinical covariate (time to next treatment)?
Extract p-values and cox model coefficients (i.e. hazard ratios )
s = summary(fit)
coef = s[["coefficients"]]
df = data.frame(
factor = factor(rownames(coef), levels = rev(rownames(coef))),
p = coef[,"Pr(>|z|)"],
coef = coef[,"exp(coef)"],
lower = s[["conf.int"]][,"lower .95"],
higher = s[["conf.int"]][,"upper .95"]
)Plot the Hazard ratio per factor, together with 95% confidence intervals
ggplot(df, aes(x=factor, y=coef, ymin=lower, ymax=higher, color=p<0.01)) +
geom_pointrange() +
coord_flip() +
scale_x_discrete() +
labs(y="Hazard Ratio", x="") +
geom_hline(aes(yintercept=1), linetype="dotted") +
theme_bw()Solutions
(Q) Based on the MOFA output, if you were to profile just one molecular layer, which one would you choose to maximise the amount of sources of variation captured?
By inspecting the variance explained plot, we can see that the RNA expression is capturing most of the sources of variation in this data set. There are only a few Factors that cannot be captured using RNA expression ( for example Factors 2 and 14). The ex vivo drug response assay also captures a lot of variability, but it is much harder to obtain from patient cohorts than RNA expression data. The other two data modalities are less informative: DNA methylation data is only active in Factor 1, 8 and 10 and Somatic mutations are only associated with Factors 1 and 3.
If were were to profile just one molecular layer for a large number of patients in a cost-effective way, we would need to compare the feasibility and costs of Drug response assays and RNA sequencing. The latter is much cheaper and more standarised in the community.
(Q) Can you suggest new RNA expression and DNA methylation markers for personalised treatment recommendations according to Factor 1 (the IGHV status)?
We first collect the genes with the largest weights for Factor 1. Then we can do boxplots stratifying samples by IGHV status, followed by statiscal testing with the null hypothesis that the average gene expression does not differ between groups (a simple t-test should work for a first exploration).
Extract mRNA weights from the MOFA object
rna.weights = get_weights(MOFAobject,
views = "mRNA",
factors = 1,
abs = TRUE, # we do not distinguish between direction of effect
as.data.frame = TRUE
)
# Extract top N genes
top.genes = rna.weights %>%
.[order(rna.weights$value, decreasing = T),] %>%
head(n=9) %>% .$feature %>% as.character
head(top.genes)[1] "ZNF667" "SEPT10" "SOWAHC" "ZNF471" "LPL" "ADAM29"
Fetch mRNA data from the MOFAobject for the top genes
rna.data = get_data(MOFAobject,
views = "mRNA",
as.data.frame = TRUE,
features = list("mRNA"=top.genes)
)
head(rna.data) view group feature sample value
1 mRNA group1 ZNF667 H045 10.323723
2 mRNA group1 SEPT10 H045 11.473792
3 mRNA group1 SOWAHC H045 10.134044
4 mRNA group1 ZNF471 H045 9.680574
5 mRNA group1 LPL H045 12.686458
6 mRNA group1 ADAM29 H045 7.129380
Add IGHV status from the sample metadata
to.plot = rna.data %>%
merge(CLL_metadata[,c("sample","IGHV")], by="sample")
colnames(to.plot)[1] "sample" "view" "group" "feature" "value" "IGHV"
(Optional) Remove samples with unknown IGHV status
to.plot = to.plot[!is.na(to.plot$IGHV),]Box plots with statistical comparison of means
ggpubr::ggboxplot(to.plot, x = "IGHV", y = "value", fill = "IGHV", facet="feature") +
stat_compare_means() +
labs(x="IGHV status", y="mRNA expression") +
theme(legend.position="none")(Q) Provide a characterisation for Factor 2
Following a similar strategy as for Factor 1, we notice that Factor 2 is also active in the somatic mutation view. Thus, there must be a mutation that underlies this phenotype. Let’s plot the corresponding weights:
plot_weights(MOFAobject,
view = "Mutations",
factor = 2,
nfeatures = 5,
abs = F
)In this case we have two mutations that have large weight. One of them is the trisomy of chromosome 12, which is the second most important clinical marker in CLL!
Let’s verify this by plotting the Factor values grouping samples by the presence or absence of trisomy12:
plot_factor(MOFAobject,
factors = 2,
color_by = "trisomy12",
dodge = TRUE,
add_violin = TRUE
)As we did for the IGHV factor we can also inspect the molecular signatures in the input data with the functions plot_data_scatter and plot_data_heatmap:
plot_data_scatter(MOFAobject,
view = "Drugs",
factor = 2,
features = 4,
sign = "positive",
color_by = "trisomy12"
) + labs(y="Drug response (cell viability)")plot_data_heatmap(MOFAobject,
view = "mRNA",
factor = 2,
features = 25,
denoise = TRUE,
cluster_rows = TRUE, cluster_cols = FALSE,
show_rownames = TRUE, show_colnames = FALSE,
scale = "row"
)(Q) Why is the scatter plot of Factor 1 vs Factor 2 important for personalised medicine?
Because it enables us to classify samples based on their molecular (i.e. phenotypic) profile. Both Factor 1 and Factor 2 are associated to somatic mutations which have significant transcriptomic and epigenetic consequences which are in turn linked to a clinically-relevant impact on the response to different drugs (ex vivo). This is precisely the aim of personalised medicine! Obviously more research is needed, but the patients from the four subgroups are likely to require different drugs for an optimal treatment.
(Q) Characterise Factor 4 based on the GSEA results?
Plotting the GSEA results for Factor 4 reveals that this Factor is capturing differences in the stress response of the blood cells. We have significantly enriched pathways such as cellular response to stress or heat shock response.
plot_enrichment(enrichment.results, factor = 4, max.pathways = 15)The top genes that are driving this pathway are Heat Shock Proteins and some inflammatory markers such as TNF.
plot_top_weights(MOFAobject,
view = "mRNA",
factor = 4,
nfeatures = 15
)It looks like these genes have a positive weight, which means that they have higher levels in the samples with positive Factor 4 values:
plot_data_scatter(MOFAobject,
factor = 4,
view = "mRNA",
features = 6,
add_lm = TRUE
) + labs(y="RNA expression")(Q) Which drugs are associated with Factor 4? What is their target pathway? Do they make biological sense?
plot_top_weights(MOFAobject,
view = "Drugs",
factor = 4,
nfeatures = 15
)plot_data_scatter(MOFAobject,
view = "Drugs",
factor = 4,
features = 6,
add_lm = TRUE
) + labs(y="Viability")Out of the 5 top drugs, the target category for 3 of them is Reactive Oxygen Species, which are closely related to stress response mechanisms.
drug_metadata[grep("SD51|BAY|SD07|MIS-43|NU7441",drug_metadata$name),] drug_id name main_targets target_category
1: D_032 NU7441 DNAPK DNA damage response
2: D_041 BAY 11-7085 NFkB NFkB
3: D_127 SD07 ROS Reactive oxygen species
4: D_141 SD51 ROS Reactive oxygen species
5: D_149 MIS-43 ROS Reactive oxygen species
pathway
1: Other
2: Other
3: Reactive oxygen species
4: Reactive oxygen species
5: Reactive oxygen species
(Q) Which Factors are associated with the clinical covariate (time to next treatment)
Factor 1 (as expected), Factor 4, 8, 9 and 11 are all statistically associated with time to next treatment.
Session info
sessionInfo()R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] survminer_0.4.9 ggpubr_0.4.0 survival_3.4-0
[4] randomForest_4.7-1.1 MOFA2_1.6.0 forcats_0.5.1
[7] stringr_1.4.1 dplyr_1.0.9 purrr_0.3.4
[10] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8
[13] tidyverse_1.3.2 ggplot2_3.4.1 data.table_1.14.8
loaded via a namespace (and not attached):
[1] googledrive_2.0.0 Rtsne_0.16 colorspace_2.0-3
[4] ggsignif_0.6.3 ellipsis_0.3.2 rprojroot_2.0.3
[7] fs_1.5.2 rstudioapi_0.13 farver_2.1.1
[10] ggrepel_0.9.1 fansi_1.0.3 lubridate_1.8.0
[13] xml2_1.3.3 splines_4.2.0 codetools_0.2-18
[16] R.methodsS3_1.8.2 mnormt_2.1.0 knitr_1.39
[19] jsonlite_1.8.3 km.ci_0.5-6 broom_1.0.0
[22] dbplyr_2.2.1 png_0.1-7 R.oo_1.25.0
[25] pheatmap_1.0.12 uwot_0.1.11 HDF5Array_1.24.1
[28] compiler_4.2.0 httr_1.4.3 basilisk_1.8.0
[31] backports_1.4.1 assertthat_0.2.1 Matrix_1.5-4
[34] fastmap_1.1.0 gargle_1.2.0 cli_3.4.1
[37] htmltools_0.5.4 tools_4.2.0 gtable_0.3.0
[40] glue_1.6.2 reshape2_1.4.4 Rcpp_1.0.9
[43] carData_3.0-5 cellranger_1.1.0 vctrs_0.5.2
[46] rhdf5filters_1.8.0 nlme_3.1-158 psych_2.2.5
[49] xfun_0.31 rvest_1.0.2 lifecycle_1.0.3
[52] rstatix_0.7.0 googlesheets4_1.0.0 zoo_1.8-10
[55] scales_1.2.0 basilisk.utils_1.8.0 hms_1.1.1
[58] MatrixGenerics_1.8.1 parallel_4.2.0 rhdf5_2.40.0
[61] RColorBrewer_1.1-3 yaml_2.3.5 gridExtra_2.3
[64] reticulate_1.25 KMsurv_0.1-5 stringi_1.7.8
[67] highr_0.9 S4Vectors_0.34.0 corrplot_0.92
[70] BiocGenerics_0.42.0 filelock_1.0.2 rlang_1.0.6
[73] pkgconfig_2.0.3 matrixStats_0.62.0 evaluate_0.15
[76] lattice_0.20-45 Rhdf5lib_1.18.2 htmlwidgets_1.5.4
[79] labeling_0.4.2 cowplot_1.1.1 tidyselect_1.1.2
[82] here_1.0.1 plyr_1.8.7 magrittr_2.0.3
[85] R6_2.5.1 IRanges_2.30.0 generics_0.1.3
[88] DelayedArray_0.22.0 DBI_1.1.3 mgcv_1.8-40
[91] pillar_1.8.0 haven_2.5.0 withr_2.5.0
[94] abind_1.4-5 dir.expiry_1.4.0 modelr_0.1.8
[97] crayon_1.5.2 car_3.1-0 survMisc_0.5.6
[100] utf8_1.2.2 tzdb_0.3.0 rmarkdown_2.14
[103] grid_4.2.0 readxl_1.4.0 reprex_2.0.1
[106] digest_0.6.30 xtable_1.8-4 R.utils_2.12.0
[109] stats4_4.2.0 munsell_0.5.0